Adding Agentic Retrieval as a new retrieveral mode#2018
Conversation
44daf00 to
4faa3c6
Compare
ed278c7 to
054256a
Compare
Greptile SummaryThis PR introduces an opt-in agentic retrieval mode (
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/agentic/retrieval.py | New file implementing the core agentic retrieval pipeline (AgenticRetrievalConfig, AgenticRetriever, evaluation helpers). Solid overall; the lock-serialized _retrieve_for_agent is now documented. Minor: num_concurrent is not validated in post_init. |
| nemo_retriever/src/nemo_retriever/agentic/init.py | Public all now includes all six exported symbols including run_agentic_beir_evaluation; addresses the prior comment about missing exports. |
| nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py | Adds reasoning_effort, backend_top_k cap, _validate_final_results_args, concurrent output ordering fix, and extensive INFO-level step logging. The INFO log for each ReAct loop step will produce up to 6 x max_steps log lines per query, which can be very noisy with the default 50-step limit. |
| nemo_retriever/src/nemo_retriever/graph/selection_agent_operator.py | Adds source-priority fallback chain (final_results -> RRF -> LLM selection -> candidate_ranking), reasoning_effort forwarding, and result_source tracking. LLM selection is now effectively bypassed when rrf_score column is present (always the case in the AgenticRetriever pipeline), making it a last-resort-only path. |
| nemo_retriever/src/nemo_retriever/graph/rrf_aggregator_operator.py | Adds react_final_rank tracking and has_valid_final_results propagation to the RRF output schema; clean and well-scoped change. |
| nemo_retriever/src/nemo_retriever/pipeline/main.py | Adds --retrieval-mode, 6 agentic CLI flags, and _run_agentic_evaluation(). The agentic LLM always uses remote_api_key with no --agentic-api-key override flag. Unbounded _qrels logging at INFO could produce very large log lines on big BEIR datasets. |
| nemo_retriever/tests/test_agentic_eval.py | New test file with 9 tests covering config validation, BEIR/recall evaluation, CLI flag wiring, and error paths. All external services are mocked. |
| nemo_retriever/tests/test_agentic_operators.py | Adds 8 new operator-level tests covering backend_top_k cap, final_results validation, concurrent ordering fix, RRF priority bypass, and fallback behavior. Comprehensive and well-named. |
| nemo_retriever/tests/test_graph_pipeline_cli.py | Renames one test to reflect the now-valid recall evaluation mode. Minimal, correct update. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
CLI["--retrieval-mode agentic"] --> AE["_run_agentic_evaluation()"]
AE --> ARC["AgenticRetrievalConfig"]
ARC --> AR["AgenticRetriever.retrieve()"]
AR --> AQI["AgenticQueryInputOperator"]
AQI --> REACT["ReActAgentOperator (max_steps=50)"]
REACT --> RETR["_retrieve_for_agent() via _lock"]
RETR --> VDB[(VectorDB)]
REACT --> RRF["RRFAggregatorOperator k=60"]
RRF --> SAO["SelectionAgentOperator"]
SAO --> P1["1 final_results"]
SAO --> P2["2 RRF ranking"]
SAO --> P3["3 LLM selection"]
SAO --> P4["4 candidate_ranking"]
SAO --> OUT["pd.DataFrame result"]
OUT --> METRICS["compute_beir_metrics()"]
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 5
nemo_retriever/src/nemo_retriever/agentic/retrieval.py:140-146
`num_concurrent` has no validation in `__post_init__`, but `react_max_steps` and `text_truncation` do. A programmatic caller who passes `num_concurrent=0` will not get an error from the config; the failure surfaces later as a `ValueError: max_workers must be greater than 0` from `ThreadPoolExecutor`, with no indication of which config field caused it.
```suggestion
def __post_init__(self) -> None:
if not str(self.llm_model).strip():
raise ValueError("Agentic retrieval requires a non-empty llm_model.")
if int(self.react_max_steps) < 1:
raise ValueError("react_max_steps must be >= 1.")
if int(self.text_truncation) < 0:
raise ValueError("text_truncation must be >= 0.")
if int(self.num_concurrent) < 1:
raise ValueError("num_concurrent must be >= 1.")
```
### Issue 2 of 5
nemo_retriever/src/nemo_retriever/pipeline/__main__.py:641
**Unbounded qrels dict logged at INFO**
`_qrels` is logged in full without any size cap. On a ViDoRe domain with 300+ queries each with multiple relevant docs, this generates a single INFO line that can exceed log-aggregator limits and makes the log unreadable. The `_run` dict immediately below already applies `[:10]` per query; `_qrels` should do the same, or both should be dropped to DEBUG.
### Issue 3 of 5
nemo_retriever/src/nemo_retriever/pipeline/__main__.py:849
**Agentic LLM always uses the general `remote_api_key`**
There is no `--agentic-api-key` CLI flag; the agentic LLM endpoint always receives `remote_api_key`. For deployments where the embedding endpoint and the agentic LLM endpoint are at different services that require different credentials, the wrong key will be sent to the LLM, resulting in an authentication error that gives no hint about which flag to set. Consider adding `--agentic-api-key` (defaulting to `remote_api_key` for backward compatibility) and resolving it the same way `agentic_invoke_url` is resolved from an env var.
### Issue 4 of 5
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py:534-535
**Per-step INFO logging produces very high log volume**
Each ReAct iteration now emits at least 2-3 INFO lines (`begin seen_docs`, `finish_reason`, and `retrieve` or `final_results`). With `max_steps=50` (the default) and `num_concurrent=N` queries, a single evaluation call produces up to `50 x 3 x N` INFO lines. The step-level loop control logs (`begin`, `finish_reason`, `no tool call; requesting continuation`) are better suited to DEBUG since they carry no actionable information beyond the retrieval and `final_results` calls that already log at INFO.
### Issue 5 of 5
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py:13-21
**`_preview_text` / `_preview_doc_ids` duplicated in two modules**
Identical implementations of `_preview_text` (and near-identical `_preview_doc_ids`) are defined in both `react_agent_operator.py` and `selection_agent_operator.py` with the same module-level constants (`_LOG_PREVIEW_CHARS = 300`, `_LOG_DOC_ID_LIMIT = 20`). A shared `graph/_utils.py` would keep these in one place and prevent drift if the truncation limit ever needs adjusting.
Reviews (3): Last reviewed commit: "added review fixes" | Re-trigger Greptile
054256a to
ce71d17
Compare
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
ce71d17 to
8c0af28
Compare
Description
Agentic retrieval mode + BEIR / query-CSV evaluation
Summary
Adds an LLM-driven agentic retrieval strategy as an alternative to the single dense-retrieval pass, plus first-class evaluation for it (BEIR-style datasets and ad-hoc query CSVs). Additive — the standard retrieval path and outputs are unchanged; agentic mode reuses the existing
Retriever/vector DB and is opt-in via--retrieval-mode agentic.What's new
ReActAgentOperatorruns a per-query ReAct loop (issues retrieval sub-queries, accumulates candidates across steps, decides when to stop) →RRFAggregatorOperatorfuses across steps (RRF, k=60) →SelectionAgentOperatordoes a final LLM selection, with a source-priority fallback chain (final_results → RRF → selection → candidate_ranking).--evaluation-mode beir— score against a registered benchmark:vidore_hf(needsdatasets) plus CSV/JSON loaders;recall@k/ndcg@k.--evaluation-mode recall— score agentic retrieval against a query CSV (query+golden_answer), no dataset loader required (agentic-only;pdf_page/pdf_only).--retrieval-mode,--agentic-llm-model,--agentic-invoke-url,--agentic-react-max-steps(50),--agentic-backend-top-k(20),--agentic-text-truncation(0 = none),--agentic-reasoning-effort(high),--agentic-num-concurrent(1), and--beir-loader/-dataset-name/-doc-id-field/-split/-query-language.agentic/README.md;test_agentic_eval.py+test_agentic_operators.py.Results — ViDoRe v3
Benchmarked against the reference agentic pipeline (
retrieval-bench) under anidentical, controlled setup so the comparison isolates the retrieval
framework: same page-level image+text index (
llama-nemotron-embed-vl-1b-v2embedder), same agent LLM (
llama-3.3-nemotron-super-49b-v1.5), same agentsettings (
reasoning_effort=high, retriever pool depth 20, target top-k 10,max 50 ReAct steps), full query sets. The retrieval substrate is shared, so the
numbers reflect the agent framework only.
▎ Both runs share the same index, embedder, agent LLM, reasoning_effort, and top-k; the only setting not pinned is agent-LLM sampling temperature (this PR uses greedy 0.0; the reference uses its endpoint default).
The graph-operator implementation tracks the reference pipeline across all eight
domains on a shared substrate.
Scope
Checklist